CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K #418

ikawrakow · 2025-05-14T16:33:16Z

This PR is a follow up of #417 and (almost) completes the quantized matrix multiplication (a.k.a. MMQ) implementation for IQX_K quants. The only one missing is IQ4_KSS, but I don't think I'll do that one as the packing is much too complicated.

There are larger performance gains for IQ2_KS (~35%) than for IQ2_K and IQ3_K (~10%). This is due to IQ2_KS having blocks of 32 and thus being able to use the more efficient GEMM kernel (see discussion in #417).

The graph illustrates the performance improvements for the same setup as in #417.

Looking at this graph and in the graph in #417, I almost feel like adding IQ3_KS and IQ5_KS as 3- and 5-bit quants with blocks of 32.

ubergarm · 2025-05-14T19:24:21Z

Wow the IQ2_KS improved around 35%!? The 32 block _KS variants have a nice speedup.

I'd probably try out the larger IQ3_KS and especially IQ5_KS for some mixes in the future if you decide to add them.

Iwan Kawrakow added 5 commits May 14, 2025 15:47

MMQ for iq2_k

92b765d

This works

b44eaaa

MMQ for iq3_k

1acecca

MMQ for iq2_ks

f069a57

Fix iq2_ks

217905c

ikawrakow merged commit 14ed9fb into main May 15, 2025

ikawrakow mentioned this pull request May 15, 2025

Adding IQ5_KS - 5.25 bpw quants #422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K #418

CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K #418

Uh oh!

ikawrakow commented May 14, 2025

Uh oh!

ubergarm commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K #418

CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K #418

Uh oh!

Conversation

ikawrakow commented May 14, 2025

Uh oh!

ubergarm commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants